Hesitation in speech can . . . um . . . help a listener understand

نویسندگان

  • Martin Corley
  • Robert J Hartsuiker
چکیده

This paper investigates the effect of disfluencies on listeners’ on-line processing of speech. More specifically, it tests the hypothesis that filled pauses like um, which tend to occur before words that are low in accessibility, act as a signal to the listener that a relatively inaccessible word is about to be produced. Two experiments are reported, in which participants followed recorded instructions to press buttons corresponding to images on a computer screen. In 50% of trials, the spoken name of the image was preceded by um. In experiment 1, the intrinsic accessibility of the target items was manipulated (by means of lexical frequency); in experiment 2, the extrinsic (visual) accessibility varied. Both experiments demonstrated that participants were quicker to respond when a target was preceded by um, regardless of whether the item referred to was difficult to access or not. In addition, in experiment 2 there was a weak interaction between accessibility and presence or absence of an um. We present the data here as early evidence that listeners can benefit from disfluencies in others’ speech, and outline some methodological and theoretical considerations and further experiments. By far the most common kind of language in use is conversation (Clark & Wilkes-Gibbs, 1987). In conversation, utterances are produced spontaneously. That is, they are “conceived and composed by their speakers even as they are spoken” (Mehta & Cutler, 1988, p. 136). A consequence of this is that spontaneous speech contains disfluencies. These are generally defined as “phenomena that interrupt the flow of speech and do not add propositional content to an utterance” (Fox Tree, 1995, p. 709). They include pauses, interruptions (midphrase or midword), repeated words and phrases, restarted sentences, words with elongated pronunciations, such as the pronounced /Di:/ and a as /Ei:/, and filled pauses such as uh and um. Such disruptions are very frequent: averaging across a number of studies, and excluding silent hesitations, Fox Tree (1995) estimated that the rate of disfluencies in spontaneous speech is about 6 words per 100 (see also Bortfield et al., 2001). Despite the many disfluencies that occur in spontaneous speech, most studies of the comprehension of spoken language have used idealised, fluent utterances. This owes much to the commonly held view that disfluencies are noise and present obstacles to comprehension (Brennan & Schober, 2001, p. 275). However, some researchers have argued that disfluencies do not constitute “noise” at all, but are actually informative to listeners: they may provide information about the state of speakers’ production systems. Specifically, certain disfluencies signal to listeners that speakers are experiencing production difficulty. Difficulty can occur at any stage of the process—during planning, lexical retrieval, or the articulation of a speech plan—and it has been argued that different types of disfluency signal different kinds of problems (e.g., Bortfield et al., 2001). To date, much of the evidence supporting this account of conversational disfluencies has come from corpus studies of filled pauses such as uh, um, the as /Di:/, and oh (e.g., Clark & Fox Tree, 2002; Fox Tree & Clark, 1997; Fox Tree & Schrock, 1999), and from experimental evidence gathered from speakers. For example, when asked to answer general knowledge questions, speakers tend to produce more uhs and ums before answers they are unsure of (Brennan & Williams, 1995; Smith & Clark, 1993). Moreover, uh appears to signal a shorter upcoming pause (and by inference, a less severe retrieval problem) than um (Smith & Clark, 1993), a finding borne out by corpus analyses (Clark & Fox Tree, 2002). A number of studies examine in more detail the circumstances that might lead to a problem with retrieval. For example, unpredictable lexical items are preceded by hesitations more often than those that are predictable (Beattie & Butterworth, 1979). There is also a wellestablished correlation between disfluency and lexical frequency. For example, Maclay and Osgood (1959) examined a sample of spontaneous speech and found that “pauses filled with er and the like” were more likely to occur before open-class than (high frequency) closedclass words; Levelt (1983) showed that the frequency of colour names correlated negatively with the probability that these would be preceded by filled pauses. However, findings concerning production do not necessarily imply that disfluencies are somehow designed to inform listeners about the states of speakers’ production systems: they could simply be a by-product of the speech production process. Moreover, and of direct relevance to the current paper, they provide no evidence that listeners can or do exploit the information provided by disfluencies. For evidence of this kind, we turn to studies in which the focus is on the listener rather than the speaker. Much of the reported evidence for listener sensitivity to disfluency comes from studies in which listeners are asked to compare or rate utterances. For example, Brennan and Williams (1995) presented participants with recorded answers to general knowledge questions, and asked them to estimate “how likely it was that the speaker knew the correct answer” (p. 389). Ratings were negatively affected by pauses before responses (as well as by the length of these pauses); but additionally, answers preceded by uh or um were judged less likely to be correct than answers preceded by unfilled (silent) pauses of the same lengths. Howell and Young (1991) found that listeners rated utterances including repairs as more comprehensible when those repairs were preceded by pauses. These studies, however, use off-line tasks (i.e., they measure comprehension after processing is complete). An important assumption underlying comprehension research is that comprehension takes place on-line (in “real time”: Marslen-Wilson & Tyler, 1980). Assuming that disfluencies convey information, rather than noise, what we ultimately want to know is whether listeners can benefit from that information as they comprehend a given utterance. Evidence that some disfluencies can immediately facilitate the comprehension of words that follow them comes from a study by Fox Tree (2001). Listeners identified words from recordings of speech with spontaneous uhs either present or digitally excised. Target words were recognised faster with the uhs present. Fox Tree concluded that uhs heighten attention to the speech that is to follow (cf. Fox Tree & Schrock, 1999, for oh). Similarly, Brennan and Schober (2001) found that compared to fluent controls, between-word interruptions (yelloworange), and mid-word interruptions with or without fillers (yelluhorange, yellorange) led to quicker identification of the “correct” (repair) word. The quickest identifications were in cases where the interruption included a filler. These studies however have a number of shortcomings. Fox Tree’s materials may have had more natural prosodies with the uhs present; and the experimental task (word identification) is still some way from natural, on-line, language comprehension. Brennan and Schober (2001) did use a more natural task (following instructions referring to objects) but in their study the interruption itself reduced the potential number of referents. Participants in their study viewed a display with two objects, one of which was the target. When the naming of one was interrupted, it was immediately clear that the other was the target, thus enabling participants to respond fast. This artefact, rather than the repairs per se, may account for their findings. At best, the on-line studies above provide only weak evidence for the proposition that listeners can exploit such information as might be conveyed by disfluencies. This is because the information content in each study is low: if disfluencies are supposed to signal problems in access for the speaker, the listener needs to be able to judge which parts of an utterance (here, references to objects or concepts) are likely to be difficult. In the above studies, accessibility is not manipulated: all referents are equally accessible or inaccessible. To date, two studies have directly manipulated accessibility in tests of listeners’ sensitivity to disfluency. Barr (2001) and Arnold, Fagnano, and Tanenhaus (2003) manipulated target words in terms of accessibility with respect to the discourse model. It is a common finding that newly introduced items are harder to access than (recently mentioned) items already in the discourse (e.g., Arnold, Wasow, Ginstrom, & Losongco, 2000), because new information has a lower expectancy. Barr (2001) presented listeners with sentences describing abstract shapes that were either familiar or new to them. Their task was to point to the shape that matched the description they were hearing. When the shapes constituted new information, listeners’ responses were faster when descriptions were preceded by an um than when they were preceded by random noise. Arnold et al. (2003) conducted a study in which participants’ eye movements were recorded while they viewed a series of displays of four objects, two of which began with the same phonological segments (e.g., candle and camel). Participants were instructed to move one of these two competitor objects, which had either been established as discoursenew or discourse-given; a proportion of instructions contained disfluencies. Arnold et al. found that, regardless of the content of the instructions, more initial fixations were made on the discourse-given competitor when the instruction was fluent, whereas a disfluent instruction led to more initial fixations on the discourse-new object. Although the studies by Arnold et al. (2003) and Barr (2001) are highly suggestive, they still leave some questions open. The first is that of the types of accessibility information which listeners can make use of when they encounter disfluent speech. It is well established that disfluency can result from language-internal, or intrinsic, accessibility difficulties (for example, speakers are more likely to be disfluent when naming low-frequency colours; Levelt, 1983). Equally, when accessibility is extrinsic, that is, manipulated independently of language, speakers are likely to be disfluent. For example, speakers are likely to use more filled pauses when describing ambiguous pictures (Siegman & Pope, 1966). The distinction between the two types of accessibility is important, because it is only in the instrinsic case that listeners must effectively be able to model the production process. The second question is that of what is is about a disfluent utterance that cues listeners to use the information. Barr (2001) and Arnold et al. (2003) used naturalistic recordings of spontaneous speech as auditory stimuli in their experiments. Although there are clear advantages to this approach in terms of ecological validity, it still leaves open the question of whether something other than a filled pause (such as, say, the prosody of an entire utterance, as suggested by Arnold et al., 2003) is acting to cue the listener to pay attention to a given object. In the experiments reported below, we aim to explore listeners’ sensitivity to intrinsic (experiment 1) and extrinsic (2) accessibility information, in the face of speaker disfluency, by manipulating the lexical frequency (1) and visual accessibility (2) of items referred to in spoken instructions. In contrast to the studies reported above, we will use digitally edited recordings, with the only difference between fluent and disfluent utterances being the presence or absence of an um. If listeners are sensitive to disfluency and can make use of the appropriate type of accessibility information, we expect them, in each experiment, to be faster to respond to fluent instructions mentioning accessible items, but to disfluent instructions referring to inaccessible objects.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards a Grammar of Spoken Language

So-called ill-formed utterances abound in natural discourse. This paper describes how listeners can understand such utterances in a discourse, including speech hesitations and slips-of-the-tongue. Through analysis of durations of utterances and their succeeding pauses, and of intonation patterns, and also through listener understanding experiments using ill-formed utterances, the following resu...

متن کامل

Hesitation disfluencies in spontaneous speech:

Human speech is peppered with ums and uhs, among other signs of hesitation in the planning process. But are these so-called fillers (or filled pauses) intentionally uttered by speakers, or are they side-effects of difficulties in the planning process? And how do listeners respond to them? In the present paper we review evidence concerning the production and comprehension of fillers such as um a...

متن کامل

The Information Value of Some Hesitation Phenomena: Filled Pauses, Lengthenings, and Entropy Reduction

Such hesitation phenomena as filled pauses (e.g., uh, um) have been argued to serve a pragmatic role as markers of impending hesitation on the part of the speaker (Smith and Clark, 1993; Clark and Fox Tree, 2002). Lengthenings (e.g., a:nd, we:ll)—also hesitation phenomena—are arguably similar. In short, these hesitation markers constitute information from which listeners make inferences. The pr...

متن کامل

بررسی برخی ویژگی های آکوستیک گفتار نوزاد مدار در مادران فارسی زبان

Introduction: When adults talk to another person, linguistic characteristics of the listener will also be considered. A clear example of speech changes depending on the listener is maternal or infant directed speech. Infant directed speech is more slowly with longer sentences and pauses at the end of the utterance. Undoubtedly the most distinctive feature of this style of speech is acoustic c...

متن کامل

Phone Elasticity in Disfluent Contexts

Disfluencies in speech are instances of hesitation or correction that affect both the speaker and the listener. Typical surface forms of disfluencies are filled pauses (such as uh, uhm), silences at places in the utterance where syntax would not predict them, or repetitions of parts of the utterance (like (I mean + I mean) we should go). Disfluencies carry a folk notion of erroneousness or badn...

متن کامل

This Version May Not Be Identical to the Published Version

Prosody is one of the most undervalued components of language, despite fulfilling manifold purposes: it can, for instance, help assign the correct meaning to compounds such as ‘white house’ (linguistic function), or help a listener understand how a speaker feels (emotional function). However, brain based models that take into account the role prosody plays in dynamic speech comprehension are st...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003